In this project, I will focus on the white wine data set provided by udacity. This dataset contains 4898 observations of various Portuguese “Vinho Verde” white wines. Eleven different characteristics were recorded by measuring physical characteristics of the wines. Additionally, an estimate of the quality of wine by a set of experts was made available. Together with a vector assigning unique consecutive numbers to the measurements this leads to 13 different variables in our dataset:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The start point of our analysis is to inspect each of the variables and observe their distributions. All the physical measurements are float numbers, while the quality of the wines is given as an integer. The statistics of these variables are:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
This table gives a general idea on the range of the values and the quantiles. To be noticed that the data set is clean and has no missing values. Representing the data as histograms should present a more clear picture:
These graphs show the distribution of univariate data. As expected, the ‘quality’ variable is the only discrete variable with just a few levels (for example, the alcohol variable is also discrete but on 103 values).
The ‘quality’ distribution looks normal, with few outliers (very bad wines marked with 3 and very good ones marked with 9). The large majority of wines are “normal” quality, leading to an unbalanced data set when it comes to training estimators.
Some of the distributions look skewed - I will focus on the ‘residual.sugar’:
Plotting the density of the residual sugar on a logarithmic x scale reveals a multimodal distribution with a clear cut point around 3.5 (the cut point between dry and medium-dry wines is usually around 4) and a second one around 10 (the cut point between medium-dry and medium wines is usually around 12). The wikipedia page for white wines [https://en.wikipedia.org/wiki/Sweetness_of_wine] mentions that it is very rare to find wines with residual sugar values of less than 1g/L. This is confirmed by our data set:
## [1] "percentage of wines with low residual sugar: 1.57207023274806%"
The same wikipedia page mentions that, as a general rule of thumb, wines with a residual sugar over 45g/L are considered sweet. Our data set is heavily biased from this perspective, the amount of entries in each taste category being:
##
## dry medium-dry medium sweet
## 2097 1975 825 1
This data set has only one sweet wine entry in almost 5000 entries! Either the region/producers from which the data is collected specializes in dry wines or there might be a systematic error in the measurement of residual sugar.
We confirm that the three peaks we see in the data correspond to the rough categories described on the wikipedia page (the cut points suggested by wikipedia being 4, 12 and 35).
Using a logarithmic scale makes some of the distributions look more close to normal distributions, without revealing multimodality:
The graph of the alcohol has an interesting shape:
According to this link: [http://www.realsimple.com/holidays-entertaining/entertaining/food-drink/alcohol-content-wine] wine can be classified in four categories, with the cut points at 12.5%, 13.5% and 14.5%:
##
## very-low moderately-low high very-high
## 4543 326 29 0
The data set exhibits a clear bias towards the very low alcohol wines. This was to be expected as Portuguese vinho verde is a typical case of very low alcohol wine.
As a final step, I investigate the distribution of the outliers in the heavily skewed distributions:
These new graphs confirm that using logarithmic scales for chlorides and free.sulfur.dioxide is a sensible choice. Their logarithmic representation (see a previous graph) resemble more normal distributions, exhibiting few outliers.
Based on the observations above, I extended the original data set with several new columns:
I started by exploring the correlation between all the pair-wise variables:
Most of the correlation coefficients in the above image are close to zero, with a few notable exceptions. A better look at the correlations is provided by:
This graph is intended as a helping tool for training classifiers. I have color and size coded the correlations between the variables in the original data set. Several observations can be made:
I will try to get a better understanding of how quality is related to other variables:
With maybe the exception of alcohol (higher values correlate with higher quality), no clear relationship related to quality stands out from these graphs. This is justified also by the small correlation found between the variables: (-0.195 - volatile.acidity, -0.3 - density, 0.436 - alcohol and 0.21 - log.chlorides).
Alcohol seems to be correlated to most of the variables in the dataset. It is exhibiting the largest correlation coefficients with the other variables (density -0.78, residual.sugar -0.451, total.sulphur.dioxide -0.449, log.chlorides -0.36).
Density correlates strongly with residual.sugar (0.839) and alcohol (-0.78):
As shown in the previous section, density correlated with residual.sugar. In the next graph, I am using color code to explore if quality is also affected by this relation. For ease of visualization, I will be using a quantified version of quality on three levels:
Let’s see how the relationship holds for each of the three classes:
Even at this large level of magnification on the y axis a linear relationship is obvious. Additionally, the quality seems to be directly linked also with the spread of values on the x axis.
The volatile.acidity adds little information - the spread on the x axis is reduced for good-quality wines and spread to the maximum for the poor wines. The relationship between residual.sugar and density is strongly visible even for the quantified version of density.
total.sulfur.dioxide is correlated with log.free.sulfur.dioxide as can be seen in the following graphs. Over-imposing the density shows a clear correlation only with the total.sulphur.dioxide. I added a linear interpolation for each of the classes:
As a final step in this section, I am interested which are the important features in the dataset from a classification perspective. As the correlations with the target variable are small, I expect a different ranking for different classifiers. The graphs below show the normalized feature importances for linear regression and random forrest. As the goal of this project is not to find the best classifier available, I left all the parameters to default values. Cross validation was used to estimate the machine learning performance.
As expected, the two sets of features have a different ordering. The interesting fact is that the random forrest classifier is slightly better then the proposed SVM classifier in the original paper (relative improvement of ~10%):
This plot showcases the histogram and the scaled distribution of the residual sugar variable, on a logarithmic scale. Several things can be noticed:
This plot shows the distribution of alcohol values across various quality thresholds. The graph supports the general opinion that, in general, wines with higher quantity of alcohol are perceived as being of higher quality.
The graph indirectly shows the large variance in the distribution of the samples across the classes, where for quality 3 and 9 very few samples are available. It also shows that the correlation between alcohol and quality is moderate - a large number of outliers is present in the figure and almost all quality bins span across the whole alcohol range.
In this plot I am trying to show the strong correlation between density and residual sugar. Although the graph shows quite some variation, the y axis does not start at 0 - changing it would result in an almost straight line, making this point stronger. By dividing the wine quality in three categories and using a linear regression model for each class, we notice a clear ordering of the cluster points from the three classes, with better wines having a lower density and less residual sugar.
In this project I have analyzed the white wines dataset provided by udacity. Visualization of the variables in the dataset helped a lot grasping more understanding of the information hidden within.
My major struggle was to find a direct correlation between the quality of the wine and the given variables. Looking at the correlation plots, it seems that the quality is an insignificant variable. My first instinct was to dismiss it with an excuse such as: “tasting wines is as subjective as any human action can be”. Then I paid a bit of attention on how data is collected: the opinions of three reviewers are averaged. So, the actual human observations are modified by two quantification steps (original marks given by reviewers and the average transformed to an integer mark) and one averaging step (which removes samples from the border categories). Once this was clear, then I understood why the authors of the paper were forced to further average the quality marks in order to boost the performance of their classifier.
The other major struggle was to find combinations of two variables that show clear clusters of quality-related data. This struggle was not fully satisfied - alhough clustering is to be found in the graphs in the multivariate section, clear distinctions between the classes is not present. Peeking again at the original paper, it seems that the authors also failed to identify any at all.
General qualitative observations from wikipedia state that, while accounting for exceptions, superior wines have high percentage of alcohol and low levels of sugar. This observation matches somewhat the data. Also, it is noted that the perceived sweetness can be modified for example by acidity (I also wonder about temperature, given that we deal with white wine). All in all, some chemistry knowledge would probably help creating a better variable for perceived sweetness. As future work, I would recommend searching the literature on how the chemicals in the data set modify perceived sweetness and build a variable based on the findings. I would be astonished if the popular knowledge is not verified by the data.